Categories

Versions

Split File by Content (Text Processing)

Synopsis

Segments documents based on regular expressions or xpath.

Description

Operator that allows to extract segments from a set of text documents in a directory based on regular expressions, XPath or simple string matching. This operator does support several formats as XML, HTML, Text and PDF, although XPath will work on XML and HTML documents only. The written files will be of the same ending as the input files type if possible. PDF for example will always be transformed into text files.

Input

  • through (File)

    The through port.

Output

  • through (File)

    The through port.

Parameters

  • previewShows a preview for the results which will be achieved by the current configuration. Range:
  • matching_modeThis parameter determines which mode for selecting the segments is used. Range:
  • xpath_querySpecifies the XPath expression that matches against substrings of the content which should be treated as individual segments. Each match is treated as single segment. Range:
  • namespacesSpecifies pairs of identifier and namespace for use in XPath queries. The namespace for (x)html is bound automatically to the identifier h. Range:
  • ignore_cdataSpecifies whether CDATA should be ignored when parsing HTML Range:
  • assume_htmlIf checked a more tolerant xml parser will be used, which copes with forbidden HTML constructions, but always assumes HTML and adds missing tags. For plain XML uncheck this. Range:
  • regular_expressionSpecifies the regular expression that matches against substrings of the content which should be treated as individual segments. Each match is treated as single segment. Range:
  • segment_expressionSpecifies the expression, which is used to replace the found match of the regular expression above. Matchinggroups might be used to specify e.g. content of attributes without including the surrounding attributes itself. Range:
  • start stringSpecifies the String used as startpoint in string matching. The text between the start string and the end string, both exclusive, is threated as segment. Range:
  • end stringSpecifies the String used as endpoint in string matching. The text between the start string and the end string, both exclusive, is threated as segment. Range:
  • json_path_querySpecifies the JSONPath expression that matches against substrings of the content which should be treated as individual segments. Each match is treated as single segment. Range:
  • textsA directory containing the documents to be segmented Range:
  • outputThe directory to which to write the segments Range:
  • use_file_extension_as_typeIf checked, the type of the files will be determined by their extensions. Unknown extensions will be treated as text files. Range:
  • content_typeThe content type of the input texts Range:
  • encodingThe encoding used for reading or writing files. Range: